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CONCEPTS CONCERNING PROTEIN SUPERFAMILIES 
Introduction 

The amino acid sequences of most leucocyte surface proteins contain segments of 
sequence that have similarities to other proteins and it is likely that the similar 
sequences have been derived by divergent evolution from common precursors. 
Dayhoff et al. 1 introduced the terms "superfamily" for proteins witri sequence 
similarity of 50% or less and "family" for those with more than 50% identity. In 
protein superfamilies there is often only 15-25% sequence identity and at this level 
it can be difficult to be confident that a sequence match indicates an evolutionary 
relationship, rather than just a chance similarity. 

The first superfamily of leucocyte surface proteins to be defined was the 
immunoglobulin superfamily (IgSF) and this is now the largest with more than 100 
different polypeptides on a variety of cell types 2 . The sequence identities between 
the members of this superfamily are at the 15-25%Mevel but analysis showed that 
the conserved residues are clustered mainly in regions corresponding to the in- 
pointing residues of p* strands of the Ig-fold. In contrast, the regions corresponding 
to the loops at the ends of the strands mostly show great sequence diversity. It may 
be regarded as a rule that in superfamilies of sequences that have derived by 
divergent evolution the conserved residues will relate to important structural 
features that are characteristic of the superfamily in question. 

Several different superfamilies have been identified within leucocyte surface 
molecules. This chapter describes the methods for their identification and shows 
alignments .of some sequences to. illustrate . the. key residues that are often 
conserved in these superfamilies. A brief description of the structure and functions 
of each domain type is given. 

Nomenclature for superfamilies, protein domains, repeats and motifs 
There is no agreed nomenclature for most superfamilies and thus in this book we 
have tried to conform to the most commonly used names and in some cases to 
introduce abbreviations that might be useful. In general where superfamilies are 
named after a receptor we use the abbreviation "R". For example, the cytokine 
receptor superfamily is called the "cytokineR" superfamily. This seems useful in 
that the name becomes distinctive and distinguishes the superfamily usage from 
discussion in which a receptor is referred to in other ways. The naming of domains 
is a problem since one might discuss Ig domains either as domains of 
immunoglobulins or as domains of the superfamily. For the superfamily usage we 
include the abbreviation "SF" in cases where there may be ambiguities. For 
example, IgSF, scavengerRSF, FN type HISF, CCPSF. 

The term "domain" is used where it is likely that a segment of sequence forms a 
discrete structural unit, i.e. a peptide sequence whose three-dimensional 
conformation is not determined by other parts of the total protein sequence but is 
"self-contained". Three criteria are considered. First, proof of a domain structure 
comes from tertiary structure determination. Domains established at this level 
include: Ig, complement control protein (CCP], EGF, fibronectin {FN) type m, 
cytokineR and the C-type lectin. The MHC domain has also been revealed by X-ray 
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control sequences give an occasional score above 2 SD but that no cc sistent 

pattern of good scores is obtained. In contrast the sequences that are consi sred to 

belong to the IgSF gave >40% scores of >3 SD. In practice, arguments foi in IgSF 

relationship have proved reliable in terms of subsequent tertiary j nicture 

detennination in cases where a group of scores >3 SD have been obtaii d with 

>33% of one of the sets of IgSF sequences as defined below (i.e. the V set, < . set or 

C2 set). Convincing arguments are usually buttressed by one or two scores : >5 SD 

that are unlikely to arise by chance, even in isolation. A number of the assical 

sequence patterns for the superfamily in question should be present in th correct 

positions in the sequence in relation to other conserved patches and the c iserved 

sequences should be consistent with a structural prediction for the relevan lomain 

in cases where the domain structure is known. Thus there should be hyd iphobic 

amino acids in the positions predicted to be in-pointing to stabilize the ten try fold 

and the Cys residues should potentially be able to form disulphide bonds hat are 

consistent with the fold. All these considerations were applied to the ar Lysis of 

the IgSF relationship of the CD2 and CD4 antigens where there 1 s been 

controversy concerning CD2 domain 1 and CD4 domain 2. In both cas< it was 

shown by structure detennination that these domains were in the IgSF id that 
correct predictions were made for the P strands in both domains 7 ~ 10 . 

A further analysis using the ALIGN program is illustrated in Table ; for the 

cytokineR superfamily. This is one of the most diverse superfamilies in :rms of 

sequence alignments as can be seen from Fig. 3. The ALIGN scores clearh support 

the superfamily relationship for the grouped sequences with the exceptii l of the 

domains nominated for the IL4 and IL7 receptors. In the case of the IL4I lomain 

only 2/10 scores are 3 SD or greater with another 4 scores being >2 SD. ius the 

case for inclusion of the IL4R domain in the superfamily is weaker on th« Dasis of 
the ALIGN scores. However, inspection of the conserved sequence patter in Fig, 

3 leaves little doubt that the IL4R domain is in the cytokineR sup family. 

Cysteine residues can be confidently placed at all of the conserved posii ras and 

other conserved patterns are also present in the correct positions. For the roposed 

EL7R domain the situation is much more ambiguous. The ALIGN scores ire very 

weak and only a hint of the conserved sequence patterns is seen in the ali unents. 

This domain would not be considered for inclusion in the cytokineR su] rfamily 

except that the domain is present in a cytokine receptor. The case for thi domain 

will ultimately require validation by three-dimensional structure determii tion. 

Domain sequence and structure: divergent and convergent evolution 

In the above section criteria for defining a superfamily have been tsed on 

identifying a sequence pattern that is shared in a non-trivial way tetween 
sequences of different molecules. It is then argued that the presenc of the 

sequence pattern indicates a relationship in evolution such that the dorr ins that 

share the sequence pattern both derive from one original primordial lomain. 

However, it could be argued that a certain structure dictates a sequence p£ em and 

the sharing of the pattern is due to convergent evolution from different recursor 

molecules rather than divergent evolution from a primordial domain. Con ;rsely it 

may be found that sequences with no detectable common pattern fori similar 

tertiary structures and thus that these are in the same superfamily eve though 
there is no detectable sequence relationship. 

It now seems very unlikely that a general structure will dictate unique 
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sequence pattern. This can be seen from a consideration of sequences that can g 
rise to domains with the Ig-fold. There are now five different sets of sequences w 
no convincing sequence similarity between them that can all give nseto the 
fold These are the sequences of the Ig superfamily, the FN type fflSF and 
cytokineRSF plus two sequences in the Pap-D bacterial protein that each give i 
to a domain with an Ig-fold There is also enormous diversity of sequel 
within the IgSF that leads to the argument that there is no unique seqw 
required to determine any part of the IgSF-f old. Thus it seems rather unlikely t 
convergent evolution to yield the same structure would give rise to any comn 

^Tconve^'argument is that all the sequences that give the same fold ig 

structure have derived by divergent evolution and that all sequences with i is 

structure should be included in the same superfamily. For example, the five set di 

sequences referred to above might all be considered as IgSF sequences, It does yt 

seem useful to take this point of view since there may be a relatively Urn id 

number of small stable protein folds that can occur and these may have evolvec m 

numerous occasions in evolution. In this case each of the sets of sequences v th 

the Ig-fold would have an independent primordial ancestor. Alternatively, tl ~ 
may have been one primordial structure which acquired mutations such that a i 
solution to the structure was produced, ultimately giving rise to sequences • 
were not detectably similar to the ancestor family of sequences. At this stage tl 
is no way to estimate the probability of the divergent versus the convergent . 

for generation of the same structure without recognizable sequence similarity ad 

it seems best to stick to sequence patterns as the criteria for denning superfami :s. 

This is sensible from a practical as well as a theoretical standpoint since sequi ce 

data are much more readily obtained than tertiary structural data and he 

superfamilies defined on the basis of sequence would be grouped as subsets wi an 

superfamilies based on tertiary structure considerations. It seems better to re un 

- thf sequence criterion and to note that certain superfamilies have the same fol" ng 
patterns in their domains. , 

Given that the same structure can arise from various sequences the que£ on 

arises as to why sequence patterns are conserved in evolution. Molecules or :he 

ceU surface present unique determinants for interaction with a soluble mole He, 

the extracellular matrix or with other cell surface receptors. Such interact >ns 

require diversity between molecules and not conservation of epitopes he 

sequence patterns shared within a superfamily conserve the fold of the moh ule 

and usually involve residues pointing inwards in the folded structure rather lan 

out-pointing residues that are available for biological interactions. Thus the 

question arises as to what evolutionary force can operate to preserve the tei ary 
structure of the molecule? 

For cell surface molecules it can be argued that the key evolutionary pressi 3 is 

the requirement for molecular stability and, in particular resistance to proteo sis. 

The small, tightly folded domains that make up most of the leucocyte mole lies 

may have evolved as parts of stable coat proteins on single cell eukaryotes * 1 ese 

coat proteins then gave rise to the farnilies of molecules that evolved along itii 

the evolution of multicellular organisms, to mediate cell division and regulati i or 

cell differentiation. Surface molecules are generally resistant to prote. ftic 

enzvmes and this resistance is based on the folded structure, smce dena ured 

molecules are easily digested. One could argue that mutation to give lew 
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recognition epitopes would be constrained by the necessity of preserving 
structure of the domain. In general this led to preservation of certain 
patterns that determine one particularly stable solution for the fold. : 
alternative sequence patterns may exist that could also give a stable f< 
reach these a number of simultaneous mutations may be required an 
switch to a new pattern may be a rare event in evolution. If a new patter 
this may become the founder of a new set of sequences in which the new 
retained, again because of the pressure of proteolysis. From this viewpoir 
likely that the Ig, FN type HI and cytokineR superf amilies all arose from ; 
ancestor via sequence shifts as described above. This view might b< 
because domains of these superfamilies are found in molecules wii 
functions and often a molecule may contain both Ig superfamily doi 
domains of the FN type HI and cytokineR superfamilies. In particular, Ig: 
type niSF domains are often found together in a single polypeptide. 

Genomic structure and evolution of proteins with mixtures of domain tyj 

The number of domains in a cell surface protein can vary greatly. In 
antigen there is a single IgSF domain making up the whole of the ex 
segment, whilst for the complement Acceptor 1 protein (CD35) the ex 
region consists of 30 CCPSF domains in a linear array. In these cases 
domain type is present but in other molecules there can be admixture < 
types. For example the L-selectdn (LECAM-1) antigen contains C-type 
EGFSF and CCPSF domains. 

The efficient build-up of proteins from individual domains during 
appears to depend on two aspects of genomic structure. There shot 
approximate concordance of the domain ends with intron/exon boundarit 
position of the intron with respect to the reading frame of a gene shoul 
that an open reading frame results from the recombination of an exor 
intron of an existing sequence 15 . Introns that are inserted after the first 
codon are called phase 1, those after the second base, phase 2, and thos< 
codons, phase 0 16 . Analysis of the intron/exon boundaries of domains p 
leucocyte surface molecules shows that for most domain types each exon 
is of the same phase (usually 1} as illustrated in Table 4. Recombinatio 
exons will lead to the construction of new open reading frames. The dor 
not need to be contained within a single exon to allow shuffling as lo 
outermost intron boundaries are compatible. For instance, some IgSF do] 
coded for by two exons 2 and the cytokineRSF domain in the IL2 recepto: 
by three exons. In the latter case the internal splice sites are phase 2 anc 
whilst the external ones are phase 1 ; thus it is not possible to get p* 
domain integrated into a sequence containing phase 1 splice sites. 
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THE SUPERFAMILIES THAT ARE FOUND IN LEUCOCYTE CEI 
SURFACE MOLECULES 

The superfamilies that are present in leucocyte surface molecules are scussed 

below together with alignments of some of the domains or repeat sequei es. The 

alignments were made using a variety of computer programs (ALIGN t , , vIPS 17 , 

PILEUP 18 ) and then modified after visual examination. The ends of the omains 
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Table 4. Exon organization of domains and repeats of leucocyte surface molet les 



Domain or repeat type 


Do the domain 


Splice site 


Usual 


boundaries 




numbe 




coincide with 




exons] 




introns with 




do man 




same splice sites? 




Complement control protein [CCP)SF 


Yes 


type i 


1 


CytokineRSF 


Yes 


type! 


2 


EGFSF 


Yes 


type 1 


1 


Fibronectin type IISF 


Yes 


typel 


1 


Fibronectin type mSF 


Yes 


type 1 


. 1 


IgSF 


Yes 


type 1 


lor2 


Lectin OtypeSF (e.g. selectins) 


Yes 


typel 


1 


Lectin C-typeSF (e.& Kupffer cell receptor] 


No 


NA 


3 


Lectin S-typeSF 


NK 


NK - 


NK 


Leucine-rich glycoprotein repeat 


No 


NA 


NA 


LinkSF 


Yes 


typel 


1 


LDLRSF 


Yes 


typel 


1 


Ly-6SF 


No 


NA 


NA 


MHC 


Yes 


typel 


1 


Nerve growth factor receptor (NGFR)SF 


No 


NA 


NA 


ScavengerRSF 


NK 


NK 


NK 


Somatomedin BSF 


Yes 


typel 


1 



} fa both the IgSF and the CCPSF domains there are examples .where the domain is encot -4 by - 

! ^ ~ two exons and also where two domains are encoded by one exon. Only limited data are 

available on some of the domains and it is possible that other examples with different 
| numbers of exons per domain or modi may be found. 

NK, not known, NA, not applicable. 



! can be difficult to define from the sequence and this problem is illustra d by 

| consideration of the structure for CD4. In CD4 the last p strand of don in 1 

continues directly into domain 2 and between CD4 domains 3 and 4 it :ems 

| highly likely that the overlap will be even greater and that the last p str id of 

i domain 3 will also be the first p strand of domain 4. Thus in the alignments s Dwn, 

'' the domains are defined with respect to key internal residues that are marke with 

an asterisk, and the begirmings and ends can be taken for statistical comparis as as 

being a constant number of residues before and after the conserved positioi . For 

example, in the case of the IgSF this is taken as 20 residues before and afi r the 

conserved Cys positions. If the goal was to express a single domain in an expj ision 

{ system then sequence alignments and structure should be taken into accoun ind a 

structural prediction would be attempted on the basis of all the data to dec le on 
) the sequence that should be expressed. 
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The complement control protein (CCP) superfamUy (Figs 1 and 2) 
This domain is named CCP because it is commonly found in proteins that control 
the complement cascade 19 . For instance, factor H consists solely of 20 CCPSF 
domains whilst other complement components contain CCPSF domains mixed 
with other domains, e.g factors B and C2 each contain three CCPSF domains 
together with a serine protease domain. The CCP domain is also commonly called 
the short consensus repeat or SCR 19 . It is present in widely different numbers in 
cell surface molecules ranging from 30 domains in complement receptor 1 (CD35) 
to a single domain in L-selectin. These domains are clearly involved in protein 
binding and the CR1 (CD35) and CR2 (CD21) complement binding regions have 
been mapped to the first four CCP domains of each of the first three groups of 
seven domains in CD35 and to the first two domains of CD21. 

The structure of one CCPSF domain from complement control protein factor H 
has recently been solved using NMR and consists of two segments of antiparallel B 
sheet and a short triple-stranded p sheet with no ct-helical structure 20 . The folding 
pattern for this domain is shown in Fig. 2 and the b strand positions are marked 
above the sequence alignments shown in Fig. 1 . 

Cytokine receptor (cytokineR) superfamily (Figs 3 and 4) 

Three domain types are found in cytokine receptors including those of the Ig, FN 
type III and cytokineR superfamilies. A common arrangement is to have a single 
NH 2 -terrninal cytokineRSF domain followed by an FN type DISF domain, but there 
are variations on this theme. Initially these two domain types were not 
distinguished 5 and the term haematopoietin receptor superfamily was widely used 
for molecules containing this pair of domain types 21 * 22 . We use the term cytokine 
receptor superfamily for the domain of about 100 amino acids usually found NH 2 - 
terminal to the FN type IUSF domain and alignments of domains from this 
superfamily are shown in Fig. 3. Analysis with the ALIGN program (Table 3) gives 
good evidence for the presence of the cytokineRSF domain in the receptors for IL2 
(0 chain), IL3, XLS, growth hormone, granulocyte-macrophage colony stimulating 
factor, erythropoietin and in the GMP130 protein 22 . The presence of a 
cytokineRSF domain in IL4R is less strongly supported by ALIGN analysis but as 
discussed above the case for inclusion of this domain is convincing if all the data 
are considered. The IL7 receptor contains a clear FN type mSF domain but the 
sequence at the NH2-terminal region shows only a distant relationship to the 
cytokineRSF domains (see p.338). The possible cytokineR domain in the IL7 receptor 
22,23 is shown below the other sequences in Fig. 3 but the correctness or otherwise of 
this assignment will require validation by tertiary structure determination. 

The structure of the growth hormone receptor has recently been solved by X-ray 
crystallography 13 and this has revealed the fold for the cytokineRSF and the FN 
type mSF domains that constitute the extracellular domain of this receptor (an FN 
type EISF domain has also been solved by NMR - see below). These domains have 
similar folds that are also similar to the folds of IgSF C2 set domains 8 > 9 and the 
PapD chaperone protein domains 11 . Bazan 24 had previously argued that there may 
be structural similarities between cytokineRSF domains, FN type CISF domains 
and IgSF domains on the basis of predicting patterns of B strands in the sequences. 
Despite the success of these predictions the degree of sequence similarity between 
these domain types is low. The cytokineRSF domains have a characteristic Cys-X- 
Trp sequence together with three other conserved Cys residues, whilst the FN type 
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Figure 1. CCP superfamily domains. Residues identical in four or more se 
are boxed. The lines above the sequences correspond to the positions of tl 
strands determined from the structure of factor H domain 16, residues 92 
(see Fig. 2) 20 . The asterisks mark the positions of the conserved residues ; 
shown on the Figures to identify domains in each entry for a molecule as 
the entries in Section II. The sequences of the following proteins are from 
Swissprot database unless otherwise indicated and the database accessio 
number and residue numbers are given in brackets. Factor H, human con 
factor H precursor domain 16 (P08603, 929-986); CD3Sdl2, complement : 
1 precursor domain 12 (P17927, 745-799); Factor B, HR16 human comply 
factor B (P00751, 10-74); L-selectin, L-selectin precursor (P14151, 195-25'c 
C4BPA, complement C4-binding protein (P04003, 249-313); IL2Rl, interl 
receptor a chain precursor (P01589, 22-83); FX1II, coagulation factor XIII 
precursor (P05160, 452-516), 
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Figure 3. | opposite) CytokineR supeifamily domains. Residues identical in four o 
mote sequences are boxed. The asterisks mark the positions of the conserved 
residues that are shown on the figures to identify domains in each entry for a 
molecule as shown in Section II. The sequences of the following proteins are froi 
the Swissprot database unless otherwise indicated and the database accession 
number and residue numbers are given in brackets. GHR, human growth hormo 
receptor precursor (P10912, 46-147); PLR, rat prolactin receptor precursor (P0S71 
21-116); GMP130, human membrane glycoprotein gpl30 precursor (PIR: A3633? 
124-218); EPOR, mouse erythropoietin receptor precursor (P14753, 42-140); 1131 
mouse IL3 receptor precursor domains 1 and 3 (PIR: A35782, dl, 29-127; d3 
243-347); GMCSFR, human GM-CSFR precursor (P15509, 116-214); IL6R, huma 
IL6 receptor precursor (P08887, 112-214); IL2Rb t human 112 receptor p chain, 
precursor (P14784, 26-125); IL4R, mouse IL4 receptor precursor (P16382, 24-122, 
IL7R, human IL7 receptor precursor (P16871, 32-127). The sequence alignments 
are from 20 amino acids NH^-terminal from the conserved CXW. The sequence 
start corresponds to residue 2 in the prolactin receptor. The COOH-terminus is 
more difficult to define due to the lack of conserved residues and that shown is 
close to the predicted boundary between the cytokineRSF domains and die FN 
type IIISF domains in GHR, PLR, IL6R. 




Figure 4. The folding pattern, 
of the cytokineR superfamil} 
and fibronectin type IIISF 
domains. Ribbon diagrams f 
the cytokineR superfamily a: 
FN type IIISF domains from 
human growth hormone 
receptor 13 , and FN type HIS. 
domain 21 from human 
fibronectin 14 . The IgSF C2-s 
domain from human CD4 
domain 2 is included for 
comparison *» 9 . The p strand 
are shown as broad arrows 
pointing from the amino to 
carboxy direction and the 
connecting loops as thinner 
lines. Some, gaps are present 
the loops of the growth 
hormone receptor where the 
structure has not been fully 
resolved 13 . Each (J strand is 
labelled using the same 
nomenclature as in the IgSF 
This lettering corresponds tc 
that in the sequence 
alignments (Figs 3,8,12). 
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mSF domains lack a conserved pattern of Cys residues. Using the ALIGN program 
100 comparisons between cytokineRSF and FN type ITJSF domains were made and 
of these 81 gave a score less than 2 SD and only eight comparisons gave scores 
greater than 3.0 SD (5.9, 5.6, 4.2, 3.7, 3.5, 3.4, 3.3, 3.0 SD ; unpublished 
observations). Although there were two good scores out of 100 and some moderate 
ones, the case for a relationship in evolution based on ALIGN analysis is much 
weaker than that for members of the individual superfamilies. The possibility of an 
origin by divergent evolution for the cytokineR, FN type DISF and IgSF domains is 
discussed above. 

Several members of the cytokineR superfamily show small patches of sequence 
similarity in their cytoplasmic domains. These are reviewed in ref. 25. 

Epidermal growth factor (EGF) superfamily (Figs 5 and 6) 

EGFSF domains are found in EGF itself and in traLnsfonning growth factor (TGF| a. 
This domain is also found in a variety of secreted proteins such as blood 
coagulation factor LX and cell surface molecules such as in the selectins L-selectin, 
E-selectin and P-selectin (CD62). The structures of EGF, TGFa and the factor LX 
EGFSF domain have recently been determined and show similarity in folding 
pattern 26 ^ 8 . The structures of EGF and factor LX EQFSF domains are shown in Fig. 
6. The latter is slightly smaller than EGF itself but is probably representative of the 
repeating EGFSF domains found in many proteins (see Fig. 5). The single EGFSF 
domain from factor IX has functional activities distinct from the EGF itself, for 
example it has Ca 2+ -binding activity 28 . It is likely that EGFSF domains are 
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Figure 5. EGF superfamily domains. Residues identical in five or more sequences 
are boxed. The asterisks mark the positions of the conserved residues that are 
marked on the domain organization figures in the entries in Section II. The 
sequences of the following proteins are from the Swissprot database and the 
database accession number and residue numbers are given in brackets. FA9-1 and 
FA9-2, human coagulation factor IX precursor (P00740, 92-130 and 131-1 72); EGF, 
human epidermal growth factor precursor (P01133, 971-1014); L-Sel, human L- 
selectin precursor (P141S1, 157-193); CD62, human CD62 or P-selectin precursor 
(P16109. 160-196); E-Sel, E-selectin precursor (P16581, 138-176); PRTC, human 
protein C precursor (P04070, 96-133); 114/ A10, mouse haematopoietic cell surface 
protein 114/A10 precursor (P19467, 232-274); NOTCH, Drosophila notch protein 
(P07207, 1021-1059). The ends of the alignment correspond to those of the 
coagulation factor IX EGFSF domain whose structure has been determined 28 . The 
structure of EGF itself has been determined for a sequence that extends a further 
four residues beyond that shown (see Fig. 6) 27 . The bars above the sequence 
indicate the positions of the ^-strands. 
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Figure 6. The folding pattern of 
EGFSF domains. Ribbon diagrams o) 
EGF 27 and a coagulation factor IX 
EGFSF domain 28 . The p strands axe 
shown as broad arrows pointing 
from the amino to carboxy direction 
and the connecting loops as thinner 
lines. The NH z -terminal core of the 
structure is similar in both domains 
but the EGF structure extends 
further with two more short p 
strands. 
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Figure 7. Fibronectin type IISP domains. Residues identical in three or more 
sequences are boxed. The asterisks mark the positions of the conserved residues 
that are marked on the domain organization figures in the entries in Section II. 
The sequences of the following proteins are from the Swissprot database and the 
database accession number and residue numbers are given in brackets. Fibr, 
human fibronectin precursor (P02751; dl, 314-373; d8, 374-434); Mannose R, 
human mannose receptor precursor (P22897, 153-212); Factor XII, human 
coagulation factor XII precursor (P00748, 32-91); Collag, human type V 
collagenase precursor (EC 3.4.24. 7) (P14780, 332-391). The ends of the alignment 
are based on the exon boundaries of the fibronectin domains. 



recognition structures that can be involved in various functions. For example, t\ > 

of the 36 EGFSF domains of the Drosophila protein Notch have been shown to : 
necessary for the interaction of Notch with the Delta and Serrate proteins Z9 . 

Fibronectin (FN) type II superfamily (Fig. 7) 

The FN type n domains were first identified as one of three different repeat! ; 

sequence patterns within the fibronectin molecule. The FN type HSF domain h s 
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been found in few other proteins and the only leucocyte molecule with this doi lin 
is the mannose receptor which contains one FN type IISF domain. The structu of 
a sequence from bovine seminal fluid protein PDC-109 that shows sequ ice 
similarity over part of the FN type II domain alignment shown in Fig. 7, has 
determined by NMR 30 . 
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Fibronectin (FN) type III superfamily (Figs 4 and 8) — 
The FN type HESF domain was identified in an extracellular matrix protein bi 
also common in membrane molecules and particularly these found in the ner 
system which often have IgSF domains z . It has also been found in large numbe 
the group of muscle proteins that bind myosin such as twitchin in Caenorhab 
elegans 31 and also in titin in mammals This is the only group of IgSF molec 
found so far in the cyto'sol. Another example of cytosolic localization of FN 
IHSF domains is in the cytoplasmic segment of the integrin p 4 chain 33 (note he 
external regions of integrins do not contain any FN type m SF domains). Th is 
currently the only example of a domain found at the surface of leucocytes whic is 
also present on the cytoplasmic side of a transmembrane protein. 

Structures for FN type HISF domains have recently been solved by NMR 14 ad 
X-ray crystallography ,3 . This domain consists of two p sheets with a sin ar 
folding pattern to the IgSF fold, the Cytoki^ieR domain and the domains of he 
PapD chaperone protein 11 . However, there is no significant sequence sirrub ty 
amongst these proteins as analysed by the methods discussed above. 

Immunoglobulin (Ig) superfamily (Figs 9-12) 

The immunoglobulin superfamily (IgSF) is the largest superfamily of cell sur ce 

proteins in general and for leucocyte antigens in particular, as is evident from ie 

collated data in Table 1 in Chapter 1 which shows that approximately 40 of 

leucocyte membrane polypeptides contain IgSF domains. The structures of sev al 

IgSF domains have been determined by X-ray crystallography includirig Ig V- an* 2- 



Figure 8. (opposite) Fibronectin type IIISF domains. Residues identical in five oi 
more sequences are boxed. The positions of the p strands determined for domed 
21 of human fibronectin are indicated above the sequences u . See Fig. 4 for 
folding patterns of FN type IIISF domains from fibronectin and growth hormon* 
receptor. The asterisks mark the positions of the conserved residues that are 
marked on the domain organization figures in the entries in Section 11. The 
sequences of the following proteins are from the Swissprot database unless 
otherwise indicated and the database accession number and residue numbers o 
given in brackets. GHR, human growth hormone receptor precursor (P 10912, 
148-251); FIBR, human fibronectin precursor (P02751:dl2 605-700, d!3 719-80". 
d!6 996-1085, d21 1447-1541); LAR, human LAR precursor (P10586, 596-694); 
TWIT, twitchin cytoplasmic protein from Caenorhabditis elegans (PIR:S07571 
1761-1854); CAML1, mouse neural adhesion molecule LI precursor (P11627, 
916-1012); IL7R, human interleukin 7 receptor precursor (P16871, 128-231); 
GMP130, human membrane glycoprotein gp 130 precursor (PIR: A36337, 221-32 
PLR, rat prolactin receptor precursor (P05710, 121-224); IL3LR, mouse IL3- 
receptor-like protein precursor (AIC2B) (PIR: A35782, d2 ; 135-243, d4; 342-441) 
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Figure 9. The folding pattern of IgSF domains. Ribbon diagrams for IgSF domains. 
Ig Vset (V H of human NEW Fab); Ig CI set ($ 2-microglobulin) ; Ig C2 set (CD4 
domain 2); and Ig Vset lacking the normally conserved disulphide between p 
strands B and F (rat CD2 domain 1). The$ strands are shown as broad arrows 
pointing from the amino to caiboxy direction and the connecting loops as thinner 
lines. These are labelled with the corresponding strand letters used in the 
alignments for the Ig V-set, Cl-set and C2-set sequences (Figs 10-12) and in the 
FN typelllSF domains (Figs 4 and 8). The data are from the Brookhaven Protein 
Structure Database apart from CD2 10 . 



Figuie 10. (opposite} Immunoglobulin V-set domains. Residues identical in five or 
more sequences are boxed. The positions of the $ strands are indicated above the 
sequences. The asterisks mark the positions of the conserved residues that are 
marked on the domain organization figures in the entries in Section II. The 
sequences of the following proteins are from the Swissprot database and the 
database accession number and residue numbers are given in brackets. Ig lambda, 
mouse Ig X chain precursor (MOPC 104E) (P01724, 21-129)*, Ig kappa, human Ig k ' 
chain Roy (P01608, 3-107); Ig heavy, human Ig heavy chain NEWM(P01825, 
3-116); TcR beta, human TcR p chain precursor (P01 733, 22-135); TcR alpha, mouse 
TcR a chain precursor (P01 739, 23-132); CD8 beta, rat CD8& chain precursor 
(P05541, 21-134); CD8 alpha, rat CD8 a chain precursor (P07725, 27-138); CD4dl, 
human CD4 precursor domain 1 (P01730, 21-123); Thy-1, rat Thy-1 precursor 
(P01830, 18-128); CD2 dl, rat CD2 precursor domain 1 (P08921, 20-120). 
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Figure 11. (opposite) Immunogl 
or mote sequences are boxed, 5 
the sequences. The asterisks m 
marked on the domain organic 
sequences of the following pro\ 
otherwise indicated and the dt 
given in brackets. Ig Lambda f | 
Kappa, human Ig k chain C re) 
region (P01857, 230-329); TcJ^ 
2M, human p 2-microglobulhr^ 
MHC Class I HLA a chain pr| 
// d2, human MHC Class II D) 

113-209). | 

s 

Figure 12. (opposite) Immundg 
or more sequences are boxed J 
the sequences. The asterisks | 
marked on the domain orgarh 
sequences of the following pn 
otherwise indicated and the a 
given in brackets. CD4d2, hva 
NCAM-Lld3, mouse neural c\ 
(P11627, 243-331); MAGd4, is 
4 (P07722, 327-412)-. Amalgaj 
3 (P 15364, 231-327); NCAM,| 
(P13590, 203-295); CEA d4, fi 
(P06731, 321-410); IgFcRII, n\ 
37-116); CD2d2, human CD2 
hum_an CD3z precursor (P077. 



domains, ^-microglobulin 
and 2 8 - 9 and CD8a 35 . Tri 
determined by NMR 10 . Thes 
by sequence similarities over 
with distinct folding patterns 
fold consists of a sandwich oj 
of 5-10 amino acids with a| 
but not all domains. The seq; 
in-pointing residues in the 
connect the strands and the| 
core of the fold is made ut 
positioning of these is show 
vary considerably in length | 
being the archetype for the j 
domains forms an addition 
connection between these f( 
in antibody and TcR V-doma 



marked on the domain organization figures in the entries m Section II. The 
Terences of the following proteins are from the Swissprot database unless ^ 
oT^eindJted and the database — , ^^J^SfT 
riven in brackets. Ig Lambda, human Ig X chain C region (P01842, / iw),ig 
Zpah^anlg /chain C region (P01834, 6-106); IgG «^^* ^ 
JL (P01BS7, 230-329) : T cR ^^^^^^^^ 
2M, human 0 2-microglobulm precursor M hc class 

MHC Class I HLA a chain precursor domam 3 ^• A02J8 ^^n2206 
U d2 human MHC Class II DR a chain precursor domam 2 (PIR.A02206, 



113-209). 



Pi™re 12 (oDDOsite) Immunoglobulin C2-set domains. Residues identical in four 

otherwise indicated and tie dxahua accession SIS. 
-gfra in backets. CD4d2. -human CD4 precursor domain 2 W 1 ™**??" 1 ' 
NCAM-Lld3 mouse neural cell adhesion molecule U precursor domain 3 
S 7 JSJmi, MAGd4,rat myelin associated glycoprotein precursor domam 
ffiSl Amalgam d3. Drosophila amalgam protein precursor domam 
a ffXSSM NC^. chicken neural cell adhesion molecule precursor 

S d4. human carcmoembryomc ant*en <*™~*> 4 

(P06731. 321-410); IgFcRII, mouse JgG ^^"JoTcm epsilon 
37-116); CD2d2, human CD2 precursor domam 2 (P067&, idf-dw), ^vo * y 
human CD3z precursor (P07766, 29-117). - - - ...._„.— .-_ 



domains, ^-microglobulin », MHC Class I antigen a3 

„.H ■) «,» and CD8a 35 The structure of domain 1 of CD2 has recently dcc 
oelrniLdW NMR «• These structures show that the IgSF domains character** 
b~ce Sties over about 100 amino aci ™ ™ *™ 
with distinct folding patterns referred to as the Ig-fold |rev.ewed m ref 12]. Thel, 
foW consists of asandwich of two P sheets, each consisting of antiparalle p stran< 
of tToTnSio acids with a conserved disulphide between die two sheets in mo 
out not StoSi The sequence similarities are 

in-pointing residues in the f! strands with "^.^^S $ 

in antibody and TcR V-domains. 
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In the IgSF there are limited sequence patterns in p strands B, C, E, and F that are 
common across the superfamily (Figs 10- 12) and other limited patterns that allow a 
subdivision of the domains. Ig and TcR V-domains have a characteristic pattern in 
the region leading into p strand F of Asp-X-Gly/Ala-X-Tyr-X-Cys. The receptor C- 
domains have a characteristic pattern between p strands B and C of GlyPheTyrPro 
and another on the COOH-terminal side of p strand F of Cys-X-Val-X-His. The Ig, 
TcR and MHC antigen C-type domains all share the same types of sequence 
patterns and are referred to as the CI set within the IgSF. With the sequencing of 
various cell surface molecules a third category of domains became evident, namely 
domains of length similar to C-domains but with some of the sequence patterns of 
V-domains. These domains are referred to as the C2 set 2 . They have V-type 
patterns in the p strand E to F region, a pattern of Pro-X-Pro is relatively common 
between p strands B and C and the pattern Cys-X-Ala-X-Asn is common after p 
strand F. CD4 domain 2 is a C2-set sequence and its structure is classified in terms 
of sheet assignments labelled as ABE/GFCC. This is in comparison to ABED/GFC 
for Cl-set sequences and ABED/GFCC'C" for V-set sequences. That is, for C2-set 
sequences the middle p strand may be generally in line with the GFC p sheet rather 
than the ABE sheet as is the case with antibody C-domains. The points about 
conserved patterns and the positioning in p sheets are made evident by comparing 
the sequence alignments in Figs 10-12 with the folding patterns for the domains in 
Fig. 9. The sequence alignments are discussed in more detail in ref. 2. 

Integrin superfamily (Figs 13 and 14] 

The integrins are a large family of related proteins that all share a heterodimeric 
structure with a and p chains that both traverse the lipid bilayer. There are at least 
20 a and eight p chains which can be found in various but not all possible 
combinations. Sequence similarities are seen within the a and p chain across all 
the integrin types (Figs 13 and 14). The integrins are known to be involved in cell 
interactions and include receptors for the extracellular matrix proteins fibronectin 
and vitronectin and for cell surface molecules ICAM-1 and ICAM-2. The integrins 
have been extensively reviewed elsewhere mcluding a companion volume in this 
FactsBook series 36 . They are expressed on many different cell types; the 
CD11/CD18 family and the CD49 very late activation antigen family (VLA) are 
expressed mainly on leucocytes. The sequence similarities in this family are 
described in more detail in refs 37-39. 

This family of related proteins does not contain other domain types apart from 
the p4 integrin that contains two FN type HI SF domains in the cytoplasmic region 
(see ref. 33 and section on fibronectin type III SF domains). 
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Figure 14. (opposite) Integrin p chains. Residues identical in 3 out of 3 of the 
sequences are boxed. The sequences of the following proteins are from the 
Swissprot database and the database accession number and residue numbers are 
given in brackets. Beta 1, human fibronectin receptor, integrin pj precursor 
(P0S556, 26-752); Beta 2, human integrin p 2 (CD18) precursor (P0S107, 024-724); 
Beta 3, human integrin p 3 (CD61) precursor (P05106, 30-742). The extracellular 
and transmembrane regions are shown. 
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Lectin C-type superfamily (Figs 15 and 16) 

This family of lectin domains are termed C-type because some members have been 
shown to bind carbohydrate in a Ca* dependent reaction * This domain has been 
found in a number of lectins such as the Kupffer ceU fucose/galactose receptor, 
heoatocyte galactose receptor, mannose binding protein from plasma, and galactose 
binding proteins in two invertebrate species, the flesh fly and sea urchin 
Lectin C-typeSF domains are found in a number of proteins not originally known to 
bind carbohydrate, such as the proteoglycan core protein**, an endothelial 
leucocyte adhesion molecule (E-selection), and leucocyte cell surface antigens such 
as L-selectin and the low affinity Fc receptor for IgE (CD23). In some cases 
carbohydrate binding for the lectin C-typeSF domain has been established, e.g. 
cartilage proteoglycan core protein 45 . _ 

; Two groups of lectin C-type domains can be distinguished. The L-selectm has the 
lectin C-typeSF domain plus about 10 residues of the signal sequence contained 
completely within one exon with phase 1 intron boundaries 46 . In cases other than 
the selecting the lectin domain is usually found spread over three exons which also 
include the COOH-terrninus of the protein and the 3' untranslated sequence « 7 . In 
the genetic region encoding the NH r terrninal side oi the exon there is a phase 1 
intron boundary and thus if this exon was inserted into an intron of another ^gene it 
might function to produce a new protein with the lectin domain at the COOH- 
terminus. If the insertion occurred after a hydrophobic region in a Type 1 
membrane protein the lectin exon would be cytoplasmic, and thus presumably or 
little functional relevance. Conversely if it were inserted in a Type H protein it 
would form a new extracellular COOH-teraiinal region with carbohydrate binding 
properties. The recent cDNA sequence of the macrophage mannose receptor « is 
the first example of a protein containing multiple lectin repeats, with eight in all. 
There is no information as yet on its genomic structure. 




Figure 16. The folding pattern of a lectin 
C-typeSF domain. Ribbon diagram of the 
lectin C-typeSF domain from the rat 
mannose binding protein 49 . The p 
strands are shown as broad arrows 
pointing from the amino to caiboxy 
direction, a hehces as coiled ribbons and 
the connecting loops as thinner lines. 
The labelling of the p strands ($1-5), a 
helices (al-2) and loops (Ll-4) 
corresponds to that in the sequence 
alignments in Fig. 15. The numbers 1 
and 2 refer to the position of the two 
holmium ions that are known to 
stabilize this region that contains a high 
proportion of nonregular secondary 
structure 49 . 
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sequences are boxed^ 
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respectively); CYCL-i 
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As well as the differences in intron/exon organization there art 
nattems that differ between these two groups of lectin C-typeSF don 
evident in the sequence alignments shown in Fig. 15. There is a chara c 
Sue at the N^-tenninus of the selectins E-selectin CD62 (P-sele< 
selectin, whilst in the other group there is a longer patch of sequence i 
terminus that shows a conserved sequence pattern of Cys-X-X-X-x- 1 ip. 

The structure of a lectin C-typeSF domain from a rat mannose bine 
has recently been, determined by X-ray crystallography * and the fold ; 
Fie 16 The structure is unusual in that it contains two regions, or 
contains non-regular secondary structure stabilized by two holmium 
crystal structure. The other contains both p sheet and a helix; this is unu: 
domains for the other cell surface molecules consist solely of p structu 

Lectin S-type superfamily (Fig. 17) . 
Galactoside binding proteins have been sequenced from several species 
to contain a sequence pattern different from that of the lectin C-typeSF. 
been termed S-type because the first examples contained free ac« 
eroups 40 These are found both intraceUularly and extracellularly and a 
ftrong sequence similarity is foun<* in the Mac-2 leucocyte antig* 
However, analysis of protein produced by recombinant DNAtechniqu 
requirement for a reducing environment for lectin activity and no acc 
groups 50 . 
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Figure 18. Leucine-rich glycoprotein repeats. Residues identical in six 
sequences are boxed. The asterisks mark the positions of the conserve 
that are marked on the domain organization figures in the entries ml 
The sequences of the following proteins are from the Swissprot datab, 
otherwise indicated and the database accession number and residue^ 
given in brackets. CD42a, human platelet glycoprotein IX precursor^ 
56-79;,- CD42b, human platelet glycoprotein IB P chain precursor (P 1 
A2g-1 A2g-2 and A2g-3, human leucine-rich a2 glycoprotem (LRG) JJ 
134-157, 158-181 and 182-205 respectively); PG-1, PG-2 and PG-3, h 
proteoglycan II precursor (P07585, 86-109,110-133 f*M£%™% 
Chao-1 and Chao-2, Drosophila chaoptin precursor. (P12024 > IjK-MS 
respectively); CYCL-1 and CYCL-2, yeast adenylate cyclase (P08678, 
891-913 respectively). 
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Leudne-rich repeats (LRR) or leucine-rich glycoprotein (LRGJ repeats (Fig. 18) 
The leucine-rich repeat (LRR) is characterized by a pattern of conserved residues 
including about 5 or 6 leucines and some other residues in a tightly defined repeat 
of 24 residues (Fig. 18). It is found both intracellularly and extracellularly in a 
variety of species including Drosophila and yeast and has also been found in the 
platelet glycoproteins CD42a and CD42b. It often occurs in an array of tandem 
repeats. For instance there are nine repeats in the leucine-rich glycoprotein where 
the repeat was first noted, 26 repeats in the yeast adenyl cyclase, 10 in the 
proteoglycan protein 51 and three in the trkB protein 52 . In some cases some 
sequence similarity is observed beyond the alignments shown and an alternative 
alignment may start from the conserved Pro position which is in the centre of the 
alignment shown. The a chain of the CD42b contains seven LRRs and these, 
together with all the remaining coding sequence, are encoded by a single exon 53 . 
Thus, this repeat is not generally coded by single exons in this case. The LRRs have 
been found in diverse proteins and they have been implicated in the specificity of 
hormone binding to gonadotropin receptors 54 and in the interaction between yeast 
adenylate cyclase and RAS proteins ss . 

The leucine and other residues in LRRs form an amphipathic sequence which 
could be involved in protein-protein or protein-lipid interactions. One of the 
repeats of the Drosophila chaoptin protein has been synthesized. This peptide is 
soluble in aqueous solution but will bind to phospholipid vesicles where it forms 
predominantly a p structure. It has been suggested that protein segments 
containing tandem repeats may also form amphipathic p sheets S6 . 

Link superfamily (Fig. 19) 

Two link superfamily domains were originally noted in the link protein that binds 
hyaluronic acid 5 ? This protein also has one IgSF domain. Subsequently, a further 
four linkSF domains were observed in the proteoglycan core protein that has a 
chondroitin sulphate binding site. This protein also contains one IgSF domain, a 
CCPSF domain and a lectin C-typeSF domain 44 . There is also a single linkSF 
domain in the CD44 antigen which is known to bind to hyaluronate S8 > 59 . 

Low density lipoprotein receptor (LDLR) superfamily {Fig. 20) 
The LDL receptor contains seven domains of about 40 amino acids with six 
conserved cysteine residues that have been called LDLRSF domains 60 . The LDLR 
also contains three EGFSF domains. LDLRSF domains have also been found in 
other proteins, notably some complement components such as C6, C9 and factor I. 
In the LDLR, four of the LDLRSF domains are each encoded by one exon whilst the 
other three are encoded by a single exon 61 . In the LDLR mutational analysis has 
indicated that the LDLRSF domains are important in the binding of some 
lipoproteins but otherwise the function of this domain type is not known ^ The 
structure of the LDLSF domain has not been determined to date. 

Ly-6 superfamily (Fig. 21) 

The Ly-6 antigens are a group of leucocyte antigens first identified in the mouse 
that consist of 70-80 amino acids containing 10 Cys residues 63 - 64 . Southern blot 
analysis indicates that many Ly-6-related genes are present in the mouse and of 
these, 10 distinct genes have been identified 6S . The Ly-6 antigens are expressed in 
non-lymphoid tissues, for example, kidney, as well as on leucocytes. Homologues 
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Figure 20. (opposite) Low density lipoprotein receptor (LDLR) supe 
domains. Residues identical in four or more sequences are boxed, 
mark the positions of the conserved residues that are marked on t 
organization figures in the entries in Section II. The sequences of i 
proteins are from the Swissprot database and the database access 
residue numbers are given in brackets. LDLR, human low density 
receptor precursor (P01130, dl 20-59, d2 61-102, d3 103-141); Coi 
trout complement C9 (P06682, 72-112); Hemo. Linker, marine wo: 
extracellular haemoglobin linker 2 chain (P18208, 61-102); Factoi 
complement factor I precursor (P05156, 253-291); Comp 7, humai 
C7 precursor (P10643, 77-116); Comp 6, human complement C6 r 
(P13671, 131-171). 

Figure 21. (opposite) Ly-6 superfamily domains. Residues identica 
more sequences are boxed. The asterisks mark the positions ofth* 
residues that are marked on the domain organization figures in tl 
Section II. The sequences of the following proteins are from the S 1 
database unless otherwise indicated and the database accession j 
residue numbers are given in brackets. CD59, human CD59 antig 
26-95); Mouse Ly-6A, (P05533, 27-105); Mouse Ly~6C, (P09S68, 2 
UPAR-2 and UPAR-3, human urokinase plasminogen activator n 
S12376, 23-99, 115-199 and 214-294 respectively); Squid Sgp2, sc 
2 residues 1-92 67 . 
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of the Ly-6 antigens have been found in the rat but not yet in 
humans the CD59 antigen has been identified as a member of the 
but seems too different in sequence from the mouse Ly-6 antij 
homologue. CD5? is a downregulatory. control protein for human 
also shows adhesion reactivity with the CD2 antigen 66 . An inv 
of the Ly-6 superfamily has been isolated from squid optic an 
tissue M > 67 . All the above molecules consist of a single Ly-6SF d 
the cell surface by a GPI anchor. 
Another member of the Ly-6 superfamily is the urokinase plas: 
' receptor. This molecule contains threedomains separated by hit 
and is also attached to the cell surface by a GPI anchor. The c 
superfamily are shown in Fig. 21 and a tertiary structure for 
remains to be determined. No Ly-6SF domain has been found in 
domains of any other superfamily and this may be because tb 
known for this superfamily are not suited to exon shuffling [Tabl 

The MHC superfamily (Figs 22-24) 

The MHC antigens and related molecules are members of the 
their membrane proximal domains being in the IgSF CI set. Ho 
terminal segments, including the al and ct2 domains of MHC C 
and the al and pi domains of the Class H a and P chains, f 
similarity to IgSF sequences 68 and the Class I domains are V 
independent structural unit as shown in Fig. 22 The Class I c 
show weak sequence similarity to each other and form a simila 
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Figure 20. (opposite) Low density lipoprotein receptor (LDLR) su) 
domains. Residues identical in four or more sequences are boxet 
mark the positions of the conserved residues that are marked on 
organization figures in the entries in Section II. The sequences o 
proteins are from the Swissprot database and the database accei 
residue numbers are given in brackets. LDLR, human low densi 
receptor precursor (P01130, dl 20-59, d2 61-102, d3 103-141); C 
trout complement C9 (P06682, 72-112); Hemo.Linker, marine w 
extracellular haemoglobin linker 2 chain (P18208, 61-102); Pact 
complement factor I precursor (P05156, 253-291); Comp 7, hum 
C7 precursor (P10643, 77-116); Comp 6, human complement C6 
(P13671, 131-171). 

Figure 21. (opposite) Ly-6 superfamily domains. Residues identic 
more sequences are boxed. The asterisks mark the positions of t 
residues that are marked on the domain organization figures in 
Section II. The sequences of the following proteins are from the 
database unless otherwise indicated and the database accessioi 
residue numbers are given in brackets. CD59, human CD59 ant 
26-95); Mouse Ly-6A t (P05533 i 27-105); Mouse Ly-6C, (P09568, 
UPAR-2 and UPAR-3, human urokinase plasminogen activator 
S12376, 23-99, 115-199 and 214-294 respectively); Squid Sgp2, . 
2 residues 1-92 



of the Ly-6 antigens have been found in the rat but not yet i 
humans the CD59 antigen has been identified as a member of tl 
but seems too different in sequence from the mouse Ly-6 ani 
homologue. CD59 is a downregulatory control protein for hum* 
also shows adhesion reactivity with the CD2 antigen 66 . An in 
of the Ly-6 superfamily has been isolated from squid optic a 
tissue 64 * 67 . All the above molecules consist of a single Ly-6SF 
the cell surface by a GPI anchor. 

Another member of the Ly-6 superfamily is the urokinase pla 
receptor. This molecule contains three domains separated by h 
and is also attached to the cell surface by a GPI anchor. The 
superfamily are shown in Fig. 21 and a tertiary structure fo: 
remains to be determined. No Ly-6SF domain has been found n 
domains, of any other superfamily and this may be because 1 
known for this superfamily are not suited to exon shuffling [Tat 

The MHC superfamUy (Figs 22r-24) 

The MHC antigens and related molecules are members of th 
their membrane proximal domains being in the IgSF CI set. H 
terminal segments, including the al and ot2 domains of MHC 
and the al and pi domains of the Class H a and p chains, 
similarity to IgSF sequences 68 and the Class I domains are 
independent structural unit as shown in Fig. 22 34 . The Class I 
show weak sequence similarity to each other and form a simil 
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platform of p strands and a single a helix. The two domains together form tl 
peptide bindmg groove of the MHC molecule. In the Class II molecules the c 
Ein showfstfong sequence similarity to Class I al and the Class H P l doma 
is most similar to Class I a2 « Thus in the sequence alignments in Figs 23 and , 
L sequences are shown as an MHC Ial set and MHC Ia2 set. These two se 




Figure 22. The folding pattern of MHC superfamUy domains The p strands are 
shown as broad arrows pointing from the amino to carboxy direction, « *^s< 
coiled ribbons and the connecting loops as thinner lines. The ^ h h ea f^tZ 
where the MHC Class I al joins the a2 domain to form the peptide bmdws 18™ 
flanked by the two a helices. The data are from the Brookhaven protein databm 

Figure 23. (opposite! MHC Ial set superfamUy domains. Residues identical in 
three or more of the sequences are boxed. The positions of the bete i strands (% 
alpha helices (a) determined for the structure of the human HLA Cte ^ 
shown above the sequences. The sequences of the following proteins are from tl 
Swissprot database and the database accession number and residue nwbers m 
R iven in brackets. MHC Class I human HLA Class I A-2 a Precursor (P01892, 
2™U6)- CD1A, human CD1A antigen precursor (P06126, 26-1 12; FcR rat, rat i 
Fc receptor precursor (P13S99, 25-115); HCMV ^^ r c f? me « fliowri J!„ r r1n 
glycopwteikmOl precursor (P08560, 19-101; Class II A-B a, mouse MHC Cla 
II A-B a chain precursor (P14434, 21-103); Class II DQ (3) a, human MHC Cla t 
II DQ (3) d chain precursor (P01909, 28-109). 

Fieure 24. (opposite) MHC Ia2 set superfamUy domains. Residues identical in 
three or more of the sequences are boxed. The positions of the beta strands ®, 
alpha helices (a) determined for the structure of the human HLA Class I are 
shown above the sequences. The sequences of the following proteins are from 
Swissprot database and the database accession number and residue numbers c 
given in brackets. MHC Class 1, human HLA Class I A-2 a ^cursor (P01892 
115-203); CD1A, human CD1A antigen precursor (P06126, 109-199); FcRrat, z 
gut Fc receptor precursor (P13599, 110-199); HCMV, human cytomegalovirus 
flycoprotein H301 precursor (P08560, 112-210); Class II A t.™use H2 Class 1 
6 chain precursor (P14483, 32-122); Class DQ (3) p, human MHC Class II DQ , 
chain precursor (P06126, 109-199). 
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together might be called the MHC superfamily. There are numeroi; sequences 

related to the classical MHC antigens and these show a Class I tyj structural 

organization, including the binding of p2-microglobiilin, with no exa pies so far 

with a class II-like organization. The Qa and Tla antigens of mice are ery similar 

in sequence to MHC Class I antigens whereas the human CD1 an gens show 
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sequence identity only at the level of about 30%. This level of identity is also see: 
for an Fc receptor of rodent neonatal gut 70 and a Class I-related molecule expresse 
by cytomegalovirus 71 . Secondary structure predictions have been used to sugges 
that the 70 kD heat shock proteins (hsp70) may also have a peptide binding groov 
like the MHC class I antigen; however the hsp70 family of proteins show littl 
sequence similarity to the other members shown in Figs 23 and 24 1U > 11Z . 

A more detailed discussion of MHC-related sequences can be found in refs. 3* 
69, 72-74. 

Nerve growth factor receptor (NGFR) superfamily (Fig. 25) 

Four cysteine-rich repeats were recognized in the extracellular part of the \o\ 
affinity nerve growth factor receptor (NGFR) and subsequently related sequenc 
repeats have been identified in a number of leucocyte cell surface antigen 
including CD40, MRC OX-40, CD27 and the tumour necrosis factor receptori 
Figure 25 shows an alignment of some of the repeats. This repeat is unusual in tha 
most of the NGFRSF molecules contain 3 or 4 repeats. No single NGFRSF repea 
sequence has been found and the repeat has not been associated with any othe 
domain types. The gene structure of the NGFR,shows that the repeat is not code 
for by a single exon 75 so it is unlikely that this repeat arose by gene duplication c 
exons encoding single repeats. A primordial gene with four repeats may hav 
evolved by unequal crossing-over during recombination and this gene probably gav 
rise to all known members of the NGFR superfamily by duplication an 
divergence. 

The rhodopsin superfamily (Fig. 26) 

The members of this large superfamily of more than 50 proteins are characterize 
by the presence of seven hydrophobic membrane-spanning sequences and ar 
reviewed in refs 76, 77. The proteins are oriented with the NH 2 -temiinus on th 
extracellular side and the COOH-teixninus on the cytoplasmic side of the plasm 
membrane. Several names have been used to describe this superfamily, such as G 
protein coupled receptor superfamily, 7TMS (7-transmembrane) and rhodopsL 
superfamily. We have chosen the term rhodopsin superfamily as this was the firs 
and best characterized member of this superfamily and does not imply an 
functional association which might later be shown to be inappropriate. Th 
sequence conservation is highest in the potential transmembrane segments, wit] 
most diversity in the NH 2 - and COOH-tennini and the cytoplasmic loop betweej 
segments 5 and 6. Most members of the rhodopsinSF have been shown to couple t- 
various G-proteins, Experiments using chimeric proteins have shown that th 
sequences contributing to G-protein attachment are found in trans mem bran 
segments 5 and 6 and the cytoplasmic loop between them. A subset of close! 
related rhodopsinSF members is found on leucocytes and includes the C5aE 
fMLPR and the IL8R (see the entries in Section II and Fig. 26). 

The scavenger receptor (scavengerR) superfamily (Fig. 27) 

Three domains with sequence similarities were identified in the extracellula 
region of the CDS antigen and later these domains were detected in macrophag 
scavenger receptors, the complement control protein factor I, the CD6 antigen an< 
the speract receptor protein present in sea urchins 78 , We call this the scavenger] 
superfamily since these molecules are the first of this superfamily with which 
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clear functional activity has been associated. However, some ligands will bind tc 
both scavenger receptors I and H but the latter lacks this type of domain so the 
functional involvement of this domain remains to be resolved 79 Initially it was 
argued that the CDS antigen domains were related to IgSF domains 80 . However, 
this contention was not supported by ALIGN analysis as described above! 
Subsequently it was suggested that the CDS domains were related in sequence to 
the domains of the PapD bacterial protein « but again this was not supported by a 
detailed analysis including CDS domains plus numerous other scavengerRSf 
domains. It is now clear that there is a separate superfamily of proteins containing 
scavengerRSF domains and alignments for this superfamily are shown in Fig. 27, 
No tertiary structure data are available yet for these domains. 

Signal transduction sequence motifs {Fig. 28) 

The signal transduction sequence motif shown in alignments in Fig. 28 is present 
in the cytoplasmic regions of several membrane proteins present in the antigen 
receptor complexes on B cells, T cells, and the IgE receptor on mast cells **. This 
motif is also found in the CDS antigen cytoplasmic domain (Beyers, Spruyt and 
Williams, unpublished). The CD3 C chain is unusual in that it has three motifs 
whereas the others have only one. One common feature of these molecules is that 
they are components of membrane complexes \ which, when cross-linked, give 
signals that lead to cell activation. This results in cell proliferation in the case of 
the antigen receptors and to degranulation of mast cells. Cross-linking of CD3e by 
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Figure 28. Signal transduction motifs. Residues identical in five or more sequences 
are boxed. The sequences of the following proteins are from the Swissprot 
database and the database accession number and residue numbers are given in 
brackets. CD3 gamma, human CD y chain precursor (P09693, 149-181); CD3 
delta, human CD3 5 chain precursor (P04234, 138-171); CD3 epsilon, mouse CD3 
e chain precursor (P22646, 159-184); CD3 zeta, human CD3 £ chain precursor 
(P20963, 61-94, 99-130, 131-163); MB1, mouse MB-1 protein precursor (PI 1911, 
1 71-204); B29, mouse B cell glycoprotein B29 precursor (P15530, 184-21 7); Fc 
epsilon R beta, rat Ig e receptor p subunit (P13386, 207-239)-, Fc epsilon R gamma, 
rat Ig e receptor y subunit precursor (P2041 1 , 54-86)-, CDS, human CDS antigen 
precursor (P06127, 442-475). 
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Figure 29. Somatomedin B superfamily domains. Residues identical : 
more sequences are boxed. The asterisks mark the positions of the cc 
residues that are marked on the domain organization figures in the e 
Section II. The sequences of the following proteins are from the Swis. 
database and the database accession number and residue numbers c 
brackets. PCI, mouse plasma cell antigen PCI (P06802; dt, 54-93; d 
PPll, human placental protein precursor (P21128, 47-88); Vitronecti 
vitronectin precursor (P04004; 22-63). 
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immobilized mAbs leads to T cell proliferation. The use of chimeri 
shown that the cross-linking of the cytoplasmic domains of the T 
chain or the CD3e, gives a TcR-like signal in cells lacking surface ex 
TcR, implying that this motif is involved in coupling the TcR t 
transduction mechanisms 82 ' S3 . 

Somatomedin B superfamily (Fig. 29) 

Somatomedin B is a serum peptide derived from vitronectin (alsc 
spreading factor) by proteolysis. The plasma cell surface antigen PC- 
contain two somatomedin BSF repeats 45 . This glycoprotein 1 
pyrophosphatase/alkaline phosphodiesterase activity 84 but this is a 
a different region of the molecule than that containing the son 
repeat. The domain-has not been found on other cell surface moled* 
is present in placental protein 11 (PP1 1 ) 86 . 

Transmembrane 4 pass (TM4) superf anuly and the relationship betw 
Chain and CD20 (Figs 30 and 31 } 

The "TM4 superfamily" is a term that we suggest for a new group c 
clear sequence similarities that are thought to traverse the lipid bil 
with both the NH 2 - and COOH-termini on the cytoplasmic face of 
This superfamily includes several leucocyte antigens such as CD? 
CD63 and TAPA-1. Alignments for the TM4 superfamily are shown 
genomic sequence of the TAPA-1 antigen shows that the sequen 
eight exons which do not indicate any simple correlation wit! 
transmembrane sequences 86 . The Lntron/exon boundaries of mc 
largely compatible with those of TAPA-1 87 in support of the argur 
molecules had a common ancestor in evolution. The majority of th 
sequence between TM4 superfamily molecules reside in the exi 
between TM sequences 3 and 4 where there are considerable 
sequence length. This loop of sequence is known to be extracell 
includes the N-linked glycosylation sites and the MRC OX-44 epit 
labelled at the cell surface maps to an Ile/Thr interchange in tl 
addition, surface labelling studies on TAPA-1 support an extracellv 
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Figure 30. (opposite) Transmembrane 4 
in four or more sequences are boxed. Z 
from the Swissprot database or the refi 
number is given in brackets (the comp) 
Schistosoma mansoni protein Sm23 (p\ 
antigen ME491 (P08962); CD9, human' 
TAPA-1 antigen (P18582); Co029, hum, 
(P1907S); R2, human R2 antigen JJ o. q 
human CD53 antigen (P19397). ij 

Figure 31. (below) Alignment of CD20 c 
between the two sequences are boxed J 
transmembrane sequences are indicate] 
The similarities are mostly within or aj 
fall off towards the COOH-terminus, 3} 
and an ALIGN score of 6.1 SD was obit 
Swissprot database accession numbersi 
chain (P20490). j 
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could be referred Sri. W^S^'^E^r 1 ""u^ *» " OW 
a Ca2* channel « and it wouM Z w 1 to show «•>" CD20 is 

FceWp chain and m^Coft el££C *° ^ * ^ " ^ ^ ~ 

The tyrosine kinase superfamily (Fig. 32) 

Tyrosine kinase domains are found in the cvtrml^n. .. 
distinguished: receptor tyrosine kmas" wWrTS *T J™ g ' 0UpS 030 ^ 
nonreceptor tyrosine kinases XcWuT 7- I transm , e »'>rane Proteins, and 
group o 'kinases mdu^Tm,^ ? located in the cytoplasm. The non-receptor 

they phosphorylate Tyr Sue? « S 3 ""T"" 0,1 «*»*» 

signal transduction pathways after lieand r„-n™v . , ^ eariy CTents m 
studied example is pS6'^ whLhVlL^ "ET'™ 11 m Ieu «>cytes the best 
and CDS and regZes sieTaf^n 71 "J**"""* domains of CD4 

are fyn wUch 2odS/S£ S^T" " ^ m0leCUle3 0thei «"»Ple S 

.^couple ^E^^tx^"-^^-* 

.atural ligands they^lenl ™H ^ ,heSe 

■ecome Xated ™d «S f ^P 1 * 8 ? 110 ^e domains 

^TSSSS-, 1 '* ^ 10 S* P^o^tion and 
.hatidylinositol iSiS^dAa ^ f m , ^ Cy , phos- 
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